Towards information system development for data extraction from web
نویسندگان
چکیده
منابع مشابه
Towards Automatic Structured Web Data Extraction System
Automatic extraction of structured data from web pages is one of the key challenges for the Web search engines to advance into the more expressive semantic level. Here we propose a novel data extraction method, called ClustVX. It exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and...
متن کاملTowards Semantic Web Information Extraction
The approach towards Semantic Web Information Extraction (IE) presented here is implemented in KIM – a platform for semantic indexing, annotation, and retrieval. It combines IE based on the mature text engineering platform (GATE1) with Semantic Web-compliant knowledge representation and management. The cornerstone is automatic generation of named-entity (NE) annotations with class and instance ...
متن کاملA Data Model for Information Extraction from the Web
The enormous amount of information available through the World Wide Web requires the development of effective tools for extracting and summarizing relevant data from Web sources. In this article we present a data model for representing Web documents and an associated SQL-like query language. Our framework provides an easy-to-use and wellformalized method for automatic generation of wrappers ext...
متن کاملRoadRunner: Towards Automatic Data Extraction from Large Web Sites
The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on real-life data-intensive Web sites confirm the feasibil...
متن کاملInformation Extraction from the Web
The goal of information extraction from the Web is to provide an integrated view on data from autonomous heterogeneous information sources The main problem with current wrap per mediator approaches is that they rely on very di erent formalisms and tools for wrappers and mediators thus leading to an impedance mismatch between the wrapper and mediator level Additionally most approaches nowadays a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Bulletin of National Technical University "KhPI". Series: System Analysis, Control and Information Technologies
سال: 2018
ISSN: 2410-2857,2079-0023
DOI: 10.20998/2079-0023.2018.22.08